Smith School of Business - Session 1 (October 6, 2017)
INSTRUCTOR
Eric Dunford | Ph.D. Candidate | GVPT
edunford@umd.edu
Today we'll cover:
R is a statistical and graphical programming language that is based off a much older language called S. It's source code is written in C, Fortran, and R. And it's completely free under a GNU General Public License.
R offers a powerful way to
R Studio is a graphical user interface (GUI) for the R programming language. The software makes R more user-friendly adding some point-and-click functionality along with a complete integration of graphs, the data environment, and the coding script.
Think of it like this: R is the engine that runs all our commands, and R Studio is the leather seats and steering wheel. One does the work, the other eases how that work is done.
To install R, download R from CRAN via the following:
To install R Studio, download from the following:
R Studio is broken up into 4 quandrants that can be arranged and customized to the users preference.
These quadrants are broken up as follows…
The console is where all the action happens. This is “R”.
All commands are processed through the console directly (that is, one can type commands directly into it) or via a script.
A script is a .R text file where we write and run code our code.
When we write a line of code, we can run it in the console by highlighting the text and…
run command + enter (mac)control + enter (windows)Everything in a script will be treated as code – that is if you run it, the line will be processed through the console.
However, we can leave comments and notes to ourselves by commenting out sections of the script using a #
R uses a specific set of rules to goven how it looks up values in the environment.
We manage data by assigning it a name, and referencing that name when we need to use the information again.
Officially, this is called lexical scoping, which comes from the computer science term “lexing”. Lexing is the process by which text represents meaningful pieces of information that the programming langauge understands.
In simple terms, an object is a bit of text that represents a specific value.
x <- 3
x
[1] 3
Here we've assigned the value 3 to the letter x. Whenever we type x, R understands that we really mean 3.
There are three standard assignment operators:
<-=assign()“Best practice” is to use the <- assignment operator.
x1 <- 3
x2 = 3
assign("x3",3)
c(x1, x2, x3)
[1] 3 3 3
Note that lexical scoping is flexible. Objects can be written and re-written when necessary.
object <- 5
object
[1] 5
object <- "A Very Vibrant Shade of Purple"
object
[1] "A Very Vibrant Shade of Purple"
One can see all the objects in the environment by either looking at the user interface in RStudio (specifically, the environment tab)…
One can see all the objects in the environment by either looking at the user interface in RStudio (specifically, the environment tab)… or by typing ls() in the console.
ls()
[1] "object" "x" "x1" "x2" "x3"
Once assigned, an object has a class. A class describes the properties of the data type or data structure assigned to an object.
We can use the function class() to find out what kind of data type or structure our object is.
class(x)
[1] "numeric"
The object x is of class numeric, i.e. a number.
There are many classes that an object can take.
obj1 <- "This is a sentence"
obj2 <- TRUE
obj3 <- factor("This is a sentence")
c(class(obj1),class(obj2),class(obj3))
[1] "character" "logical" "factor"
Understanding what class of object one is dealing with is important — as it will determine what kind of manipulations one can do or what functions an object will work with.
As noted, there are many different data types in R. We will primarily run into the following types:
| Type | Example |
|---|---|
| Integer | 7 |
| Numeric | 4.56 |
| Character | “Hello!” |
| Logical | TRUE |
| Factor | "cat" (1) |
When need be, an object can be coerced to be a different class.
x
[1] 3
as.character(x)
[1] "3"
Here we transformed x – which was an object containing the value 3 – into a character. x is now a string with the text “3”.
We often want to get rid of objects after creating them. To delete (or drop) an object from the working directory, use the function rm() – which stands for “remove”.
ls()
[1] "obj1" "obj2" "obj3" "object" "x" "x1" "x2" "x3"
rm(x,x1,x2,x3,X)
ls()
[1] "obj1" "obj2" "obj3" "object"
We can also remove all objects from the environment at once by typing the following command.
rm(list=ls(all=T))
Or we can do so from R Studio by clicking on the broom icon.
Objects offer a way to reference different data. This means that we can play around with a lot of different data type simultaneously.
This makes it easier to:
Note that the only way to hold onto information is to assign it as an object! Else the information is printed but instantly forgotten by
R
A function is a type of object in R that can perform a specific task. Unlike objects that hold data, functions take arguments and return the output of some manipulation.
A function is specified first with the object name and then parentheses. For example, the function log() calulates the natural log of any number placed inside the parentheses.
log(4)
[1] 1.386294
Functions operate in the background.
There are a number of functions in R, known as base functions, that are always running when you turn R on.
When we need to do things that are not a part of the base functionality, we can import new functions by installing packages. But more on this later.
We've already come a across a few functions, and we'll learn a lot more moving forward. Just keep in mind that whenever something is wrapped in parentheses (), it's a function.
Here are examples of a few common base functions that we'll see.
| Function | Description |
|---|---|
c() |
links entries together as a vector |
as.character() |
coerces the input to be a character class |
length() |
reports how “long” a vector or data frame is |
dim() |
reports the dimensions of a data frame |
class() |
reports the class of an object |
All functions in R contain rich documentation regarding how a function works, the inputs it requires, and example code. We can access this documentation by using ? in front of the function.
?c()
There are also many ways data can be organized in R.
The same object can be organized in different ways depending on the needs to the user. Some commonly used data structures include:
vectormatrixdata.framelistarrayX <- c(1, 2, 4, 5, 44, 6, 10)
X
[1] 1 2 4 5 44 6 10
class(X)
[1] "numeric"
length(X)
[1] 7
data.frame(X)
X
1 1
2 2
3 4
4 5
5 44
6 6
7 10
matrix(X)
[,1]
[1,] 1
[2,] 2
[3,] 4
[4,] 5
[5,] 44
[6,] 6
[7,] 10
list(X)
[[1]]
[1] 1 2 4 5 44 6 10
array(X,dim = c(2,2,2))
, , 1
[,1] [,2]
[1,] 1 4
[2,] 2 5
, , 2
[,1] [,2]
[1,] 44 10
[2,] 6 1
There are many ways to organize the same piece of information in R, and different data structures afford us different advantages and bring with them different limitations.
Throughout this short course, data frames will be the dominate data structure that we use; however, as you become more acquainted with R, you'll see and use other types of data structures more often.
One must understand the structure of an object in order to systematically access the material contained within it.
Let's use a dataset inherent to R called cars. There are a number of datasets that are built into R. These are for demonstration purposes.
Note that these data will not appear in the environment until we assign them to an object.
data <- cars
class(data)
[1] "data.frame"
An easy way to see what's inside a data object is to just print() it. R prints objects automatically in the console.
data
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
7 10 18
8 10 26
9 10 34
10 11 17
11 11 28
12 12 14
13 12 20
14 12 24
15 12 28
16 13 26
17 13 34
18 13 34
19 13 46
20 14 26
21 14 36
22 14 60
23 14 80
24 15 20
25 15 26
26 15 54
27 16 32
28 16 40
29 17 32
30 17 40
31 17 50
32 18 42
33 18 56
34 18 76
35 18 84
36 19 36
37 19 46
38 19 68
39 20 32
40 20 48
41 20 52
42 20 56
43 20 64
44 22 66
45 23 54
46 24 70
47 24 92
48 24 93
49 24 120
50 25 85
We can look at the structure of a data object by using the str() function.
str(data)
'data.frame': 50 obs. of 2 variables:
$ speed: num 4 4 7 7 8 9 10 10 10 11 ...
$ dist : num 2 10 4 22 16 10 18 26 34 17 ...
Or grad that variable names using the colnames() function.
colnames(data)
[1] "speed" "dist"
We can leverage what we know about the dimensionality of the data to extract parts of it.
We do this by using brackets [] alongside the data object. We then can access the dimensions in the data by specifying the row and column
data[row,column]
The function dim() can tell use about the dimensions of a data object.
dim(data)
[1] 50 2
We now know that the object data has 50 rows and 2 columns.
data[,2] # Access the entire 2nd column
[1] 2 10 4 22 16 10 18 26 34 17 28 14 20 24 28 26 34
[18] 34 46 26 36 60 80 20 26 54 32 40 32 40 50 42 56 76
[35] 84 36 46 68 32 48 52 56 64 66 54 70 92 93 120 85
data[49,] # Access just the 49th row
speed dist
49 24 120
data[1,2] # Access just a cell
[1] 2
The key is to keep in mind the dimensions. We can't access data that isn't there.
data[51,]
speed dist
NA NA NA
Most data objects can be accessed using $ call sign.
$ acts as a key by which we can extract a specific variable or data feature.
If we hit Tab after specifying the $ after our data object, R Studio will offer a list of all available variables.
Here we call the speed variable from our dataset.
data$speed
[1] 4 4 7 7 8 9 10 10 10 11 11 12 12 12 12 13 13 13 13 14 14 14 14
[24] 15 15 15 16 16 17 17 17 18 18 18 18 19 19 19 20 20 20 20 20 22 23 24
[47] 24 24 24 25
There are many functions designed to help us understand the dimensions of a data structure.
dim(data) # Dimensions
[1] 50 2
nrow(data) # Number of Rows
[1] 50
ncol(data) # Number of Columns
[1] 2
There are also some useful functions built into R to view portions of a data structure.
head(data,3) # Reports the 3 first entries
speed dist
1 4 2
2 4 10
3 7 4
tail(data,3) # Reports the 3 last entries
speed dist
48 24 93
49 24 120
50 25 85
summary() allows for one to quickly summarize the distributions across a set of variables
summary(data)
speed dist
Min. : 4.0 Min. : 2.00
1st Qu.:12.0 1st Qu.: 26.00
Median :15.0 Median : 36.00
Mean :15.4 Mean : 42.98
3rd Qu.:19.0 3rd Qu.: 56.00
Max. :25.0 Max. :120.00
There are a number of packages that are supplied with the R distribution. These are known as “base packages” and they are in the background the second one starts a session in R.
A package is a set of functions and programs that perform specific tasks. By installing packages, we introduce new forms of functionality to the R environment.
To use the content in a package, one first needs to install it. One can do this by utilizing the following function: install.packages(). By inserting the name of a specific package, we can connect to an R “mirror” and download the binary of the package.
install.packages("ggplot2")
The version of that package is then saved on your computer and can be called at any time (on or offline).
Once installed, it's on the system for good. You can then reference or load the package any time you wish to use a function from it.
There are two functions we can use to load a package: library() and require().
library(ggplot2)
# or
require(ggplot2)
You must load the package before you can use any function in it.
R Studio also offers us a way to install packages through the interface.
If we click on the Packages tab and then click Install, we can download a package by typing its name.
We then can load the package from R Studio by clicking the check box beside the packages name.
Sometimes one has a lot of packages running simultaneously.
No problem: we can see what packages are up and running by typign sessionInfo() into the console.
This will tell us everything about the version of R and the packages we are using to run our analysis.
sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X El Capitan 10.11.3
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] knitr_1.15
loaded via a namespace (and not attached):
[1] magrittr_1.5 tools_3.3.2 stringi_1.1.2 stringr_1.2.0 evaluate_0.10
If you ever try to run a function and you get the following prompt…
Error: could not find function "qplot"
It's likely you forgot to load the package .
require(ggplot2) # First Load the package
qplot() # Then run the function
# Wah-la!
R allows you to import a large variety of datasets into the environment. However, R's base packages only support a few data types.
No Fear: there is usually always an external package that can do the job!
We are going to focus on three packages to import different data types:
readr — an expansive array of functions to read different data typesreadxl — for excel spreadsheetshaven — for SPSS, SAS, and .dtaFirst, we need to install these packages onto our computer.
install.packages("readr")
install.packages("readxl")
install.packages("haven")
And then load them into our current R Session.
require(readr)
require(readxl)
require(haven)
R doesn't intuitively know where your data is. If the data is in a special folder entitled “my_data”, we have to tell R how to get there.
We can do this three ways:
R StudioEvery time R boots up, it does so in the same place, unless we tell it to go somewhere else.
We can find out which directory we are in by using the getwd() function.
getwd() # Get the current working directory
/Users/edunford/
Every time R boots up, it does so in the same place, unless we tell it to go somewhere else.
We can then set a new working director by establishing the path to the folder we want to work in as a string in the function setwd()
setwd("/Users/edunford/Desktop/my_data")
getwd()
/Users/edunford/Desktop/my_data/
R Studio also makes setting the working directory really easy.
Click: Session → Set Working Directory → Choose Directory...
This will allow you to set the working directly quickly. The downside is that you have to do it manually every time you return to this project. By writing a script for everything you do, it is easier to replicate (and for others to replicate) your work.
Finally, we can also just point directly to the data by outlining the specific path.
Here we are assigning a sting containing our path to the object path.
path <- "~/Desktop/my_data/data.csv"
We then load the data using that path.
read.csv(path)
Here we will review how to import five seperate data types:
.dta — STATA file format.csv — comma seperated file format.sav — SPSS file format.xlsx — standard Excel file format.Rdata — R's file formatFor all versions of STATA
require(haven)
data <- read_dta(file = "data.dta")
Other packages:
readstata13foreignread.csv() and read.table() are both base functions in R.
data <- read.csv(file = "data.csv",
stringsAsFactors = F)
# Or
data <- read.table(file = "data.csv",
header = T,
sep=",",
stringsAsFactors = F)
These functions have specific arguments that we are referencing:
stringsAsFactors means that we don't want all character vectors in the data.frame to be converted to factors. header means the first row of the data are column names. sep means that entries are seperated by commas.The readr package provides a much simpler approach.
require(readr)
data <- read_csv("data.csv")
characters aren't converted to factors.For SPSS and SAS file formats, the haven packages offers a simple way of reading in data.
require(haven)
data <- read_sav(file = "data.sav") # SPSS
require(readxl)
data <- read_excel("data.xlsx")
Even select from specific sheets.
excel_sheets("data.xlsx") # list avail. sheets
[1] Sheet1, Sheet2
data <- read_excel("data.xlsx",
sheet = 'Sheet1')
.Rdata is the data source inherent to R. It saves and loads objects.
load(file='data.Rdata')
There is also a point-and-click option for importing and exporting data in R.
If we go into the Environments tab and then click Import Dataset
Exporing data is the same process in reverse. Instead of reading the data, we want to write a new version of it.
There are a series of functions (each provided by their respective packages) that allow us to do just that.
Each require that you input the data that you're looking to export and specify the file name and paths to tell the computer where the file is going.
write_dta(data,path ="data.dta")
write_csv(data,path ="data.csv")
write_sav(data,path ="data.sav")
write_sas(data,path ="data.sas")
write_tsv(data,path ="data.tab")
# etc.
.Rdata offers two options to save data. We can either save a single data object, or save the entire workspace
# Save just an object
save(data, file="data.Rdata")
# Save the entire workspace
save.image(file="workspace.Rdata")
You'll find an annotated script walking through object creation and data importation in R.
objects_and_importing_data.RBroadly speaking, R functions as general calculator that can process a variety of data types.
As we can see, most operators in R are the usual suspects, but some forms are particular to R.
Operation Calc Out
Addition 3 + 4 7
Subtraction 3 - 4 -1
Multiplication 3 * 4 12
Division 3 / 4 .75
Exponentiation 3 ^ 4 81
In the example, we'll walk through a few more operators.
There are a range of functions designed to ease mathematical calculations. Some of these functions are to calculate specific values, such as the natural log or Euler's number (\( e^a \)).
log(4)
[1] 1.386294
exp(5)
[1] 148.4132
There are a range of functions designed to ease mathematical calculations. Others can be used to find the sum for a numerical vector, the mean, or the median
x <- c(1,3,7,100)
sum(x)
[1] 111
mean(x)
[1] 27.75
median(x)
[1] 5
Boolean statement (i.e. true/false statements) are central to any computer programming environment. Boolean statements allow us to make quick conditional evaluations, which are key to subsetting data.
The following outlines the various types of boolean statements available.
x == y # equals to
x != y # does not equal
x >= y # greater than or equal to
x <= y # less than or equal to
x > y # greater than
x < y # less than
Statements can be combined using and (&) or (|) statements to make more specific queries.
x==1 & y==5 # "and" conditional statements
x==1 | y==5 # "or" conditional statements
Boolean statements can be fed directly into data objects via the brackets method []. This offers a powerful and simple way to subset data.
x <- c(1,33,100,.6,5,77)
x
[1] 1.0 33.0 100.0 0.6 5.0 77.0
x[x > 30]
[1] 33 100 77
There are also a number of base functions that provide useful boolean evaluations. Here are just a few examples…
is.character("hello") # for class
[1] TRUE
all(c(T,F,F)) # are all entries True?
[1] FALSE
identical(1+1,2) # are these entries the same?
[1] TRUE
Finally, boolean statements have a nice property in R. If we convert a boolean statement to a numeric class, TRUE values convert to 1 and FALSE values convert to 0.
This offers us a quick way of generating dichotomous values.
x <- 1:10
x >= 5
[1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
as.numeric(x >= 5)
[1] 0 0 0 0 1 1 1 1 1 1
We often must deal with problematic text data. Sometimes we need to format responses from a survey so that we can use them in some analysis; other times we are just trying to calculate the date.
Most data is often riddled with errors and issues that are costly to resolve. In a sense, this data is dirty. We can't run analysis on dirty data.
Regular expressions are a special text string for describing a search pattern. We can extract, clean, and manipulate text using these expressions — which can save one hours from needing to manually clean data.
Consider the following string vector…
countries <- c("Canada","Russia","New Zealand","New Guinea")
Say, from this vector, we want to identify which entry contains the word “new”. The grep() function can help us identify a specific pattern, which it will then return the position of the string.
grep(pattern = "New", countries)
[1] 3 4
Here it return position 3 and position 4, which correspond with the position in the vector.
We can use that position to draw out specific content.
position <- grep(pattern = "New", countries)
countries[position]
[1] "New Zealand" "New Guinea"
This feature can be useful to identify relevant content in variable or body of text.
gsub() can help us actually manipulate the content in a string by identifying a pattern and then replacing it with something new.
countries
[1] "Canada" "Russia" "New Zealand" "New Guinea"
gsub(pattern = "New",replacement = "Old",countries)
[1] "Canada" "Russia" "Old Zealand" "Old Guinea"
We can also manipulate cases with the tolower() and toupper() functions.
string <- "This Is ReAlLY imPORtant."
tolower(string)
[1] "this is really important."
toupper(string)
[1] "THIS IS REALLY IMPORTANT."
We can also get rid of excessive spaces using the trimws() function.
sent <- " This sentence has a ton of white space "
sent
[1] " This sentence has a ton of white space "
trimws(sent)
[1] "This sentence has a ton of white space"
There are generic ways to draw out specific kinds of content from a string: such as digits or punctuation. There are many different types of regular expressions, and we don't have time to review all of them here, but here are a few useful ones.
"\\w" → words"\\d" → digits"\\s" → space character"*" → fuzzy"+" → More than one"[]" → Match anything inside the bracketsHere let's remove problems from the following string using gsub().
trouble <- "This ::String is a 2Problem; 56"
trouble <- gsub("[::]","",trouble)
trouble
[1] "This String is a 2Problem; 56"
trouble <- gsub("\\d*","",trouble)
trouble
[1] "This String is a Problem; "
trouble <- gsub("[;]",".",trouble)
trimws(trouble)
[1] "This String is a Problem."
We can also target all punctuation with the "[[:punct:]]" regular expression.
dirt <- "C^lean%% this $%&*_@string((!"
dirt
[1] "C^lean%% this $%&*_@string((!"
gsub("[[:punct:]]","",dirt)
[1] "Clean this string"
We can also join or paste text using R. To do so, we'll use the paste() function, which takes two arguments: the strings and a specified seperator.
sent1 <- "It is nice outside."
sent2 <- "I'll go for a walk."
paste(sent1,sent2,sep = " ")
[1] "It is nice outside. I'll go for a walk."
paste(sent1,sent2,sep = "::::")
[1] "It is nice outside.::::I'll go for a walk."
We can also use paste() to collapse a string vectors down into a single line. We do this by specifying the collapse= argument, which is like seperate in that it wants to know how the vector should be collapsed.
countries
[1] "Canada" "Russia" "New Zealand" "New Guinea"
paste(countries,collapse=", ")
[1] "Canada, Russia, New Zealand, New Guinea"
collapse= can be used with paste() in useful ways.
sent1 <- "These are the countries that matter:"
countries_sent <- paste(countries,
collapse=", ")
paste(sent1, countries_sent,sep=" ")
[1] "These are the countries that matter: Canada, Russia, New Zealand, New Guinea"
R has a specific Date class. We will use the function as.Date() to coerce a relevant string into a date class.
str <- "2006-04-30"
class(str)
[1] "character"
date_str <- as.Date(str)
class(date_str)
[1] "Date"
Objects of class date have some nice properties, that makes analysis and manipulation easy.
date_str
[1] "2006-04-30"
date_str + 30 # date in 30 days
[1] "2006-05-30"
date_str - 3000 # date 300 days ago.
[1] "1998-02-11"
This also allows us to look at the distance between two dates.
date1
[1] "2015-06-07"
date2
[1] "2013-02-14"
date1-date2
Time difference of 843 days
That said, dates come in many different formats. To let R know that a specific string is a date, we have to tell it the date format.
example <- "February 3, 1987"
as.Date(example)
Error in charToDate(x) :
character string is not in a standard unambiguous format
That said, dates come in many different formats. To let R know that a specific string is a date, we have to tell it the date format.
example <- "February 3, 1987"
as.Date(example, format = "%B %d, %Y")
[1] "1987-02-03"
Formatting dates is similar to regular expressions in that it has a special syntax. In a string (i.e. using “ ”), we specify the exact pattern of the date with all appropriate punctuation and spacing.
The following are the main expressions used in formatting.
%d → day as a number %a → abbreviated weekday%A → unabbreviated weekday%m → month as number%b → abbreviated month%B → unabbreviated month%y → 2 digit year%Y → 4 digit yearas.Date("Friday March 13, 2009","%A %B %d, %Y")
[1] "2009-03-13"
as.Date("11/13/14","%m/%d/%y")
[1] "2014-11-13"
as.Date("7th of May 2000","%dth of %B %Y")
[1] "2000-05-07"
Open up an R Studio session and open operators_and_cleaning_text.R.
Here we'll review some of the textual manipulations we learned and we'll explore the powerful stringr package for text manipulation.
With our general understanding of R, we'll cover a comprehensive logic for data manipulation and graphics.
The goal is to leave with a thorough tool kit for data analytics in R.
Finally, we'll cover basic statistical models in R.
Please contact me if you have any questions in the meantime. Thanks!
Eric Dunford | Ph.D. Canidate
Department of Government and Politics
University of Maryland, College Park
edunford@umd.edu
www.ericdunford.com